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Abstract 

In this note we derive a simple Bayesian sampler for linear regression with the horseshoe 
hierarchy. A new interpretation of the horseshoe model is presented, and extensions to logistic 
regression, negative-binomial regression and alternative prior hierarchies, such as horseshoe-)-, are 
discussed. Due to the conjugacy of the proposed hierarchy, Chib’s algorithm may be used to easily 
compute the marginal likelihood of the model. 


1 Introduction 


Consider the following Bayesian linear regression 

y|X, /3, a 2 ~ 

model for data y £ R n 

A/”n(X/3, cr 2 l n ), 

(1) 

n | \ 2 2 2 

, <y ~ 

A/”(0, A jT 2 a 2 ), 

(2) 

2 

a ~ 

-2,2 
a a<7 , 

(3) 

A i - 

C+(0,1), 

(4) 

T 

C+(0,1) 

(5) 


where X £ R nXp is a matrix of predictor variables (not necessarily full rank), A/*,(•,•) is the k- 
variate Gaussian distribution, C + (0,1) is the standard half-Cauchy distribution with probability 
density function 


p(z) 


2 

7r( 1 + Z 2 ) ’ 


2 > 0 , 


( 6 ) 


and j = ( 1 ,... ,p). It is usual to require that the p predictors are standardized to have zero mean 
and unit length and that the data y is centred. This avoids the need to explicitly model a separate 
parameter for the intercept. 

Equations |l]j5| define the horseshoe regression hierarchy recently proposed in [T]. The horse¬ 
shoe model is a global-local shrinkage procedure in which the local shrinkage for coefficient /3j 
is determined by A j > 0 and the overall level of shrinkage is determined by the hyperparameter 
r > 0. The particular choice of a half-Cauchy prior distribution over the global and local hyperpa¬ 
rameters results in aggressive shrinkage of small coefficients (i.e., noise) and virtually no shrinkage 
of sufficiently large coefficients (i.e., signal). This is in contrast to the well known Bayesian lasso [21 
and Bayesian ridge hierarchies where the shrinkage effect is uniform across all coefficients. Further 
favourable properties of the horseshoe model are discussed in PUS]. 

The original horseshoe paper does not provide details for efficient sampling from the poste¬ 
rior distribution of the regression coefficients. A standard Gibbs sampling approach is difficult to 
implement due to the non-standard form of the conditional posterior distributions for the hyper¬ 
parameters (Ai,..., A p ) and t. Subsequent papers have suggested the use specialised algorithms, 
such as slice sampling [5], for the hyperparameters [6]. In this paper, we provide an alternative 
sampling scheme for all model parameters based on auxiliary variables that leads to conjugate 
conditional posterior distributions for all parameters, making the application of Gibbs sampling 
relatively straightforward. 
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2 Bayesian horseshoe with auxiliary variables 

We make use of the following scale mixture representation of the half-Cauchy distribution. Let x 
and a be random variables such that 


x 2 \a ~ IQ(l/2, 1/a) and a ~ XC/(l/2,1/A 2 ); 


(7) 


then x ~ C + (0, A) [7], where is the inverse-gamma distribution with probability density 

function 

P(z\a,p) = j^y 2 "“ _1 exp . (8) 

Using the decomposition |7]) leads to the revised horseshoe hierarchy 


yjx, /3, a 2 ~ 

Pjiy,T ,cr ~ 
2 

a ~ 



Fl j■ ■ ■ , Vp , £ 


A/” n (X/3, er 2 I n ), 

A/”(0, \ 2 r 2 a 2 ), 

- 2,2 
(j a<7 , 

xg(i/2,i/ Vj ), 
ig{ 1/2,1/e), 

X0(l/2,1). 


The above hierarchy makes Gibbs sampling from the posterior distribution straightforward. The 
conditional posterior distribution of the regression coefficients (3 £ R p j8] is 

/3|- ~ Vp(A- 1 X T y,o- 2 A^ 1 ), A = (X T X +A/ 1 ), A* = t 2 A, (9) 

where A = diag(A 2 ,..., Ap). The conditional posterior distribution of <r 2 is an inverse-gamma 
distribution given by 

a 2 |. ~ ig ((n + p)/ 2, (y - X/3) T (y - X/3)/2 + /3 T A“ 1 /3/2) . (10) 


The conditional posterior distributions for the local and global hypervariances are also of inverse- 
gamma type 



ig 

ig 


( 1; Vi + 2t 4 2 ) ’ 1,2, ‘" ,p )’ 

/ p + i 1 , 

V 2 


( 11 ) 


Finally, the conditional posterior distributions for the auxiliary variables are: 


Cl- 


ig(i,i + ^ J, (j = i,2,... ,p), 

ig (1,1 + ^). 


It is interesting to note that the conditional posterior distributions for all parameters, except the 
regression coefficients, are inverse-Gamma, for which efficient samplers exist. 


2.1 Horseshoe logistic regression 

The Gibbs sampling approach proposed in this paper can be extended to other models and other 
prior distributions. In logistic regression, equation Q in the hierarchy becomes 

J/i|xi,/3 ~ Binom(l, 1/(1 + e - ^)), (12) 
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where yi £ {0,1} and ipi = x T /3 are the log-odds of success. 

Bayesian logistic regression with the horseshoe hierarchy may be implemented using a scale 
mixture based on ^-distributions [9] or the Polya-gamma data augmentation framework |10] for 
modelling the logistic function at the top level of the hierarchy. Using the Polya-gamma approach, 
the conditional posterior distribution of /3 given auxiliary variables u> = (wi,..., co n ) T and data yi 
(* = 1,..., n) is a multivariate Gaussian given by 

/3|cu,y oc np((3) exp f-^(z - X/3) T fi(z - X/3)^j , (13) 

where z = (ki/wi, ..., K n /uj n ) T , Ki = yi — 1/2, fl = diag(uj) and Ttp{0) is a prior distribution for 
the regression coefficients. The conditional posterior distribution for the auxiliary variables uj is 

Wi |/3~PG(l,x?73), (14) 

which is a Polya-gamma distribution with shape parameter 1 and scale parameter x.J/3. With the 
horseshoe prior, the conditional posterior distribution for the regression coefficients becomes 

/3|-~ W p (A _1 X T nz, A' 1 ), A = (X t QX + A" 1 ), A» = r 2 A (15) 

where A = diag(Af,..., A p). The conditional posterior distributions for the prior hyperparameters 
remain unchanged. Unlike in the case of linear regression, the intercept parameter must now be 
explicitly modelled. Special care should be taken to ensure that the intercept parameter is not 
penalized (e.g., by using a uniform prior distribution). 


2.2 Horseshoe negative-binomial regression 

The Polya-gamma data augmentation strategy may also be used to derive a Bayesian horseshoe 
estimator of negative-binomial regression for count data m- Data yt (i = 1,2, ...,n) is now 
assumed to be generated by the negative binomial distribution 

p(yi\h,m) oc (1 - m) h n (ft > 0) (16) 

where nt = exp('i/>i)/(l + exp(^i)) and ipi = x T /3. This is equivalent to the sampling model 

z|X,/3 ~./V„(X/3,f2 -1 ) (17) 


where Zi = (yt — h )/2 and fl = diag(wi,CJ 2 , • ■ •, u}„). The conditional posterior distribution for the 
auxiliary variables w is now 

0Ji\f3 ~ PG(yi + h,x?f3), (18) 


which is a Polya-gamma distribution with shape parameter (yi + h) and scale parameter x//3. 
The conditional posterior distribution of the regression coefficients (3 with the horseshoe prior is 
equivalent to the horseshoe logistic regression model (151. As before, the intercept parameter must 
be explicitly modelled and should not be subject to any shrinkage. 


2.3 Alternative Prior Distributions 

Extensions of the hierarchy 0 0 to other prior distributions, such as the horseshoe-1- El , are 
straightforward and involve a direct application of the auxiliary variable representation. In the 
horseshoe+ hierarchy, the local shrinkage parameters A j are given the half-Cauchy prior distribu¬ 
tion 

Aj ~ C + (0, iy) (19) 

where j = (1,2,... ,p) and r/j is a further half-Cauchy mixing variable which is given the prior 
density 

r 7 ?~C+(0,1). (20) 
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Sample size (n) 


Number of predictors ( p ) 



10 

50 

100 

500 

1000 


10 

0.44 

0.34 

0.43 

1.94 

5.96 


50 

0.21 

0.28 

0.45 

2.31 

6.76 

bhs 

100 

0.22 

0.29 

0.57 

2.99 

7.50 


500 

0.24 

0.44 

0.65 

8.36 

32.46 


1000 

0.30 

0.53 

0.83 

10.05 

34.75 


10 

0.01 

0.23 

1.49 

165.21 

1470.74 


50 

0.02 

0.25 

1.51 

165.06 

1480.70 

monomvn 

100 

0.02 

0.24 

1.48 

165.29 

1474.30 


500 

0.05 

0.31 

1.57 

165.49 

1488.92 


1000 

0.09 

0.41 

1.73 

166.27 

1481.17 


Table 1: Comparison of run times between the R package monomvn and our implementation (bhs) of the 
Bayesian horseshoe sampler. In each test, the samplers generated 1,000 samples from the posterior distribution 
of the Bayesian horseshoe. All timings are given in seconds. 


The conditional posterior distribution of the local-shrinkage coefficients 
the case of the horseshoe-|- estimator. The conditional posterior density 
Vj is now 


A j is equivalent to (111 in 
for the auxiliary variables 


Uj\- ~TQ ^) . 

Sampling of the auxiliary mixing variables r/j is then 
position 0 - 


O' = 1,2,...,p), 

a straightforward 


( 21 ) 

application of the decom- 


3 Evaluation 

We have conducted a simulation experiment to compare run times between our implementation 
(bhs) and the R package monomvn implementation (available on the CRAN repository) of the 
Bayesian horseshoe estimator. This comparison is expected to be somewhat biased as the two 
packages are implemented in different programming languages — our implementation is in pure 
MATLAB code, while the monomvn package is implemented as an R interface to compiled C code. 
The execution times of both procedures were compared across a range of values for sample size 
and the number of predictors. For each timing run, both procedures were timed generating 1, 000 
samples from the Bayesian horseshoe posterior distribution and all timing results were measured 
in seconds. The results of the timing experiment are given in Table [I] For small sample sizes and 
small numbers of predictors, the monomvn implementation is faster than bhs, with both procedures 
finishing in less than one second of execution time. However, as the number of predictors grows our 
implementation is significantly faster than monomvn. As an example, for the experiment (n = 1, 000, 
p = 1,000), our implementation is approximately 40 times faster than monomvn. 

The monomvn package uses the conventional approach to sampling from the posterior distri¬ 
bution of the horseshoe estimator by slice sampling the hyperparameters r and (Ai,..., A p ). We 
compared our sampling procedure based on auxiliary variables (bhs) against the monomvn imple¬ 
mentation using the diabetes data set examined in [2|. This dataset consists of 442 observations 
and 10 predictors, some of which are highly correlated. 

The sampling efficiency of the two procedures was compared using the effective sample size 
metric as discussed in ]TJ[. The effective sample size for each regression coefficient was estimated, 
under varying levels of thinning, from a chain of 50 million posterior samples generated by both 
bhs and monomvn. For all regression coefficients and under all levels of thinning, the bhs procedure 
was found to have a higher effective sample size. As an example, Figure [l] depicts the effective 
sample size for predictors S2 and S3 as a function of the level of thinning. These two predictors had 
the smallest observed effective sample size of the ten predictors in the diabetes data set. The bhs 
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Thinning 

Figure 1: Comparison of sampling efficiency between monomvn and bhs in terms of effective sample 
size (expressed as a proportion) 


procedure is clearly more efficient than monomvn and the difference appears greatest for thinning 
levels of four to eight. In order to achieve an effective sample size of 80% for all ten predictors, the 
monomvn sampler required a thinning level of 16; in contrast, the bhs sampler required a thinning 
level of 11 to achieve the same sampling efficiency. Despite the introduction of latent variables, 
the bhs algorithm appears to be more efficient in terms of effective sample size than the monomvn 
procedure which is based on slice sampling with no auxiliary variables. This may be due to the slice 
sampler suffering from numerical issues driven by the heavy tails of the Cauchy prior distributions. 


4 Discussion 

The decomposition 0 reveals an interesting novel interpretation of the horseshoe hierarchy. Inte¬ 
grating out the hypervariances (A?,..., Xp) implies a Cauchy prior distribution over each regression 
coefficient /3j of the form 

ftlrW,- ~C(0,2 tct/(2z/,.) 1/2 ). (22) 

The horseshoe model can therefore be viewed as placing a Cauchy prior distribution over each 
regression coefficient fij with the scale of each prior inversely proportional to Vj. 

Gibbs sampling may also be used in the case where the global scale parameter a > 0 is given 
a half-Cauchy prior distribution, as recommended in [13], by using the decomposition |t|. An 
interesting consequence of the conjugacy of all conditional posterior distributions is that Chib’s 
algorithm m for computing the marginal likelihood from the output of a Gibbs sampler is readily 
applicable as the normalizing constants for all conditionals are known. This is in contrast to 
sampling methods that use algorithms such as slice sampling. 

Our MATLAB™ implementation of Bayesian horseshoe linear regression is available from the 
MATLAB Central File Exchange repository (File ID #52479). The implementation uses Rue’s 
algorithm m for efficient sampling from the multivariate Gaussian conditional posterior distri¬ 
bution of the regression coefficients when the sample size is greater than the number of predictors 
(n > p). This algorithm is based on Cholesky factorization with cubic order of complexity in 
terms of the number of predictors p. For the case where the number of predictors is large, this 
approach for sampling from multivariate Gaussian distributions is computationally inefficient. In 
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this setting, our implementation employs the sampling algorithm given in m which has linear 
complexity in terms of the number of predictors. 
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